MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025
MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025jmestwa-coder wants to merge 2 commits into
Conversation
| break; | ||
| case Type::BYTE_ARRAY: | ||
| case Type::FIXED_LEN_BYTE_ARRAY: | ||
| if (logical_type != nullptr && logical_type->is_float16()) { |
There was a problem hiding this comment.
Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?
There was a problem hiding this comment.
FormatStatValue doesn't receive the column's type_length, so it can't validate the full FLBA width here. The only fixed-size read for FLBA is the 2-byte float16 load, which is what this case guards. The decimal, string and hex paths all key off val.size() and stay in bounds.
There was a problem hiding this comment.
I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.
There was a problem hiding this comment.
BTW, we need to create an issue for changes like this.
| case Type::INT96: | ||
| required = 3 * sizeof(int32_t); | ||
| break; | ||
| case Type::BYTE_ARRAY: |
There was a problem hiding this comment.
It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.
There was a problem hiding this comment.
Good point. Moved BYTE_ARRAY down with UNDEFINED since it can't carry a float16 logical type and has no fixed-width load here anyway.
| default: | ||
| break; | ||
| } | ||
| if (val.size() < required) { |
There was a problem hiding this comment.
I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.
There was a problem hiding this comment.
Fair, it's mostly the debug/printer path. The thing is val is taken verbatim from the file's Thrift stats, so a truncated min/max makes the BOOLEAN/INT96/numeric memcpys read past the end of the buffer even there. The guard is just a size check in front of the existing loads, so I tried to keep it minimal. Happy to trim it further if it still feels too heavy.
FormatStatValue in cpp/src/parquet/types.cc memcpys sizeof(physical_type) bytes from a string_view that comes straight out of the Thrift statistics block, so a crafted file with min_value or max_value shorter than the column's width (for example, a zero-byte stat on an INT96 column) reads past the buffer. The check belongs in this callee since callers in printer.cc just forward whatever the metadata holds. New ASSERT_THROW coverage in types_test.cc exercises the short-value paths for BOOLEAN, INT32, INT64, FLOAT, DOUBLE, INT96, Decimal and Float16.